Goto

Collaborating Authors

 nt model



Response to Reviewer # 1

Neural Information Processing Systems

The goal of this paper is to determine whether neural networks (NN) are equivalent to kernel methods or not. In this paper, we argue in the opposite direction, namely NN is superior to NT. All of these generalizations are accessible to the same proof techniques developed in our paper. Before our paper, this problem was open for quadratic activations as well. We have extensive experimental results comparing NN and NT for RELU and Tanh activations (see figure).




When Do Neural Networks Outperform Kernel Methods?

Ghorbani, Behrooz, Mei, Song, Misiakiewicz, Theodor, Montanari, Andrea

arXiv.org Machine Learning

For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If feature vectors are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the feature vectors display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present a model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.


Improving the Performance of Online Neural Transducer Models

Sainath, Tara N., Chiu, Chung-Cheng, Prabhavalkar, Rohit, Kannan, Anjuli, Wu, Yonghui, Nguyen, Patrick, Chen, Zhifeng

arXiv.org Machine Learning

ABSTRACT Having a sequence-to-sequence model which can operate in an online fashion is important for streaming applications such as Voice Search. Neural transducer is a streaming sequence-to-sequence model, but has shown a significant degradation in performance compared to nonstreaming models such as Listen, Attend and Spell (LAS). Specifically, we look at increasing the window over which NT computes attention, mainly by looking backwards in time so the model still remains online. In addition, we explore initializing a NT model from a LAS-trained model so that it is guided with a better alignment. Finally, we explore including stronger language models such as using wordpiece models, and applying an external LM during the beam search. On a Voice Search task, we find with these improvements we can get NT to match the performance of LAS. 1. INTRODUCTION Sequence-to-sequence models have become popular in the automatic speech recognition (ASR) community [1, 2, 3, 4], as they allow for one neural network to jointly learn an acoutic, pronunciation and language model, greatly simplifying the ASR pipeline.